main question
- North America > Canada > Alberta (0.04)
- Europe > Bulgaria (0.04)
- North America > United States > Kentucky (0.04)
- (10 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Information Technology (0.92)
- Media (0.68)
- Government > Regional Government > North America Government > United States Government (0.46)
Think Straight, Stop Smart: Structured Reasoning for Efficient Multi-Hop RAG
Bang, Jihwan, Lee, Juntae, Yang, Seunghan, Choi, Sungha
Multi-hop retrieval-augmented generation (RAG) is a promising strategy for complex reasoning, yet existing iterative prompting approaches remain inefficient. They often regenerate predictable token sequences at every step and rely on stochastic stopping, leading to excessive token usage and unstable termination. We propose TSSS (Think Straight, Stop Smart), a structured multi-hop RAG framework designed for efficiency. TSSS introduces (i) a template-based reasoning that caches recurring prefixes and anchors sub-queries to the main question, reducing token generation cost while promoting stable reasoning, and (ii) a retriever-based terminator, which deterministically halts reasoning once additional sub-queries collapse into repetition. This separation of structured reasoning and termination control enables both faster inference and more reliable answers. On HotpotQA, 2WikiMultiHop, and MuSiQue, TSSS achieves state-of-the-art accuracy and competitive efficiency among RAG-CoT approaches, highlighting its effectiveness in efficiency-constrained scenarios such as on-device inference.
- Telecommunications (0.41)
- Semiconductors & Electronics (0.41)
main questions raised in the reviews. 2 Reviewer
We thank the Reviewers for their thoughtful assessment of our work and valuable comments. We will work on improving the writing for the final version, as suggested. The test can naturally be applied at any point of the training process to see if overfitting has happened. We used different random seeds for each training process. Indeed, hyperparameter selection is one of the potential sources of overfitting.
given the time-and space-bounded aspects of the rebuttal, hoping we clarified the main questions of the reviewers
We thank the four reviewers for their insightful comments and suggestions. I looked into the paper in ref[12] . . . ": In [12], the greedy algorithm is generic, with no assumptions about models ": Random search leads to a set of For Tab. 1, we ran the Wilcoxon signed-rank test (paired along settings, datasets and model types) and For Tab. 2 (with more costly experiments), we do not have enough runs to apply such We nonetheless report the standard errors in the paper, which seem to indicate significant improvements. ": Those numbers indicate the size of the ensemble; we will clarify this point. ": We thank R1 for the idea and ran our entire benchmark for ResNet-20: ": Hyper ensembles can indeed be viewed as a mixture They typically use Bayes nonparametric priors/posteriors and MCMC; we use mixtures and SGD. ": When used with replacement, the greedy algorithm from Caruana et al. [12, Sec.
ISCA: A Framework for Interview-Style Conversational Agents
Welch, Charles, Lahnala, Allison, Varadarajan, Vasudha, Flek, Lucie, Mihalcea, Rada, Boyd, J. Lomax, Sedoc, João
We present a low-compute non-generative system for implementing interview-style conversational agents which can be used to facilitate qualitative data collection through controlled interactions and quantitative analysis. Use cases include applications to tracking attitude formation or behavior change, where control or standardization over the conversational flow is desired. We show how our system can be easily adjusted through an online administrative panel to create new interviews, making the tool accessible without coding. Two case studies are presented as example applications, one regarding the Expressive Interviewing system for COVID-19 and the other a semi-structured interview to survey public opinion on emerging neurotechnology. Our code is open-source, allowing others to build off of our work and develop extensions for additional functionality.
- North America > United States > Michigan (0.04)
- Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
- Europe > Bulgaria (0.04)
- (9 more...)
- Research Report > Experimental Study (1.00)
- Questionnaire & Opinion Survey (1.00)
- Personal > Interview (1.00)
939314105ce8701e67489642ef4d49e8-AuthorFeedback.pdf
We answer your main questions as follows. "Is there any hope to avoid the We will add a remark in the paper to discuss this point more thoroughly. Question 2. "Technically, I think in order for Lemma 4 to hold, f needs to be defined on the whole vector space" The issue has also been identified by Reviewer #3. We will improve the paper writing to make this point more clear. Question 2. "what regret ... if ... only access to 1 gradient query per step, rather than the two used in OEGD." We address your main questions as follows. Question 1. "how would the lower-bound of function appear in your bounds if we assume they are not positive" Question 2. "how would the algorithms / results change if 0 is not in X?" Answer 2. There are three places we use this assumption: About the self-bounding property of smooth functions, you are absolutely correct. For other minor issues, we will carefully revise the paper according to your constructive comments. Below we address your concerns and clarify the misunderstandings. Question 2. "The novelty of the paper is limited.
RISE: Reasoning Enhancement via Iterative Self-Exploration in Multi-hop Question Answering
He, Bolei, He, Xinran, Chen, Mengke, Xue, Xianwei, Zhu, Ying, Ling, Zhenhua
Large Language Models (LLMs) excel in many areas but continue to face challenges with complex reasoning tasks, such as Multi-Hop Question Answering (MHQA). MHQA requires integrating evidence from diverse sources while managing intricate logical dependencies, often leads to errors in reasoning. Retrieval-Augmented Generation (RAG), widely employed in MHQA tasks, faces challenges in effectively filtering noisy data and retrieving all necessary evidence, thereby limiting its effectiveness in addressing MHQA challenges. To address these challenges, we propose RISE:Reasoning Enhancement via Iterative Self-Exploration, a novel framework designed to enhance models' reasoning capability through iterative self-exploration. Specifically, RISE involves three key steps in addressing MHQA tasks: question decomposition, retrieve-then-read, and self-critique. By leveraging continuous self-exploration, RISE identifies accurate reasoning paths, iteratively self-improving the model's capability to integrate evidence, maintain logical consistency, and enhance performance in MHQA tasks. Extensive experiments on multiple MHQA benchmarks demonstrate that RISE significantly improves reasoning accuracy and task performance.
- North America > United States (0.14)
- Asia > Singapore (0.04)
- North America > Mexico > Mexico City > Mexico City (0.04)
- (6 more...)
- Information Technology > Artificial Intelligence > Natural Language > Question Answering (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
MINTQA: A Multi-Hop Question Answering Benchmark for Evaluating LLMs on New and Tail Knowledge
He, Jie, Hu, Nan, Long, Wanqiu, Chen, Jiaoyan, Pan, Jeff Z.
Large language models (LLMs) have demonstrated impressive capabilities in various reasoning tasks but face significant challenges with complex, knowledge-intensive multi-hop queries, particularly those involving new or long-tail knowledge. Existing benchmarks often fail to fully address these challenges. To bridge this gap, we introduce MINTQA (Multi-hop Question Answering on New and Tail Knowledge), a comprehensive benchmark to evaluate LLMs' capabilities in multi-hop reasoning across four critical dimensions: question handling strategy, sub-question generation, retrieval-augmented generation, and iterative or dynamic decomposition and retrieval. MINTQA comprises 10,479 question-answer pairs for evaluating new knowledge and 17,887 pairs for assessing long-tail knowledge, with each question equipped with corresponding sub-questions and answers. Our systematic evaluation of 22 state-of-the-art LLMs on MINTQA reveals significant limitations in their ability to handle complex knowledge base queries, particularly in handling new or unpopular knowledge. Our findings highlight critical challenges and offer insights for advancing multi-hop reasoning capabilities. The MINTQA benchmark is available at https://github.com/probe2/multi-hop/.
- North America > United States > New York > New York County > New York City (0.14)
- Europe > Spain > Galicia > Madrid (0.04)
- South America > Peru > Arequipa Department > Arequipa Province > Arequipa (0.04)
- (32 more...)
- Leisure & Entertainment > Sports (1.00)
- Health & Medicine (0.68)
Comprehensive and Practical Evaluation of Retrieval-Augmented Generation Systems for Medical Question Answering
Ngo, Nghia Trung, Van Nguyen, Chien, Dernoncourt, Franck, Nguyen, Thien Huu
Retrieval-augmented generation (RAG) has emerged as a promising approach to enhance the performance of large language models (LLMs) in knowledge-intensive tasks such as those from medical domain. However, the sensitive nature of the medical domain necessitates a completely accurate and trustworthy system. While existing RAG benchmarks primarily focus on the standard retrieve-answer setting, they overlook many practical scenarios that measure crucial aspects of a reliable medical system. This paper addresses this gap by providing a comprehensive evaluation framework for medical question-answering (QA) systems in a RAG setting for these situations, including sufficiency, integration, and robustness. We introduce Medical Retrieval-Augmented Generation Benchmark (MedRGB) that provides various supplementary elements to four medical QA datasets for testing LLMs' ability to handle these specific scenarios. Utilizing MedRGB, we conduct extensive evaluations of both state-of-the-art commercial LLMs and open-source models across multiple retrieval conditions. Our experimental results reveals current models' limited ability to handle noise and misinformation in the retrieved documents. We further analyze the LLMs' reasoning processes to provides valuable insights and future directions for developing RAG systems in this critical medical domain.
- North America > United States > Oregon (0.04)
- North America > Canada (0.04)
- Europe > Austria (0.04)
- (3 more...)
SportQA: A Benchmark for Sports Understanding in Large Language Models
Xia, Haotian, Yang, Zhengbang, Wang, Yuqing, Tracy, Rhys, Zhao, Yun, Huang, Dongdong, Chen, Zezhi, Zhu, Yan, Wang, Yuan-fang, Shen, Weining
A deep understanding of sports, a field rich in strategic and dynamic content, is crucial for advancing Natural Language Processing (NLP). This holds particular significance in the context of evaluating and advancing Large Language Models (LLMs), given the existing gap in specialized benchmarks. To bridge this gap, we introduce SportQA, a novel benchmark specifically designed for evaluating LLMs in the context of sports understanding. SportQA encompasses over 70,000 multiple-choice questions across three distinct difficulty levels, each targeting different aspects of sports knowledge from basic historical facts to intricate, scenario-based reasoning tasks. We conducted a thorough evaluation of prevalent LLMs, mainly utilizing few-shot learning paradigms supplemented by chain-of-thought (CoT) prompting. Our results reveal that while LLMs exhibit competent performance in basic sports knowledge, they struggle with more complex, scenario-based sports reasoning, lagging behind human expertise. The introduction of SportQA marks a significant step forward in NLP, offering a tool for assessing and enhancing sports understanding in LLMs.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > California > Santa Barbara County > Santa Barbara (0.14)
- North America > United States > California > Orange County > Irvine (0.14)
- (7 more...)
- Leisure & Entertainment > Sports > Football (1.00)
- Leisure & Entertainment > Sports > Basketball (1.00)
- Education (1.00)
- Leisure & Entertainment > Sports > Soccer (0.93)